Goto

Collaborating Authors

 dynamical central limit theorem


A Dynamical Central Limit Theorem for Shallow Neural Networks

Neural Information Processing Systems

Recent theoretical work has characterized the dynamics and convergence properties for wide shallow neural networks trained via gradient descent; the asymptotic regime in which the number of parameters tends towards infinity has been dubbed the mean-field limit. At initialization, the randomly sampled parameters lead to a deviation from the mean-field limit that is dictated by the classical central limit theorem (CLT). However, the dynamics of training introduces correlations among the parameters raising the question of how the fluctuations evolve during training. Here, we analyze the mean-field dynamics as a Wasserstein gradient flow and prove that the deviations from the mean-field evolution scaled by the width, in the width-asymptotic limit, remain bounded throughout training. This observation has implications for both the approximation rate and the generalization: the upper bound we obtain is controlled by a Monte-Carlo type resampling error, which importantly does not depend on dimension. We also relate the bound on the fluctuations to the total variation norm of the measure to which the dynamics converges, which in turn controls the generalization error.


Review for NeurIPS paper: A Dynamical Central Limit Theorem for Shallow Neural Networks

Neural Information Processing Systems

Weaknesses: Proposition 2.1 is tangential and not new in content or proof technique; much the same was shown in, eg, [Mei, Misiakiewicz, Montanari '19] and other works building upon it. The proofs of Propositions 3.1 and 3.2, the most meaningful results, are simple calculations making use of the Mean Value Theorem and Duhamel's principle, respectively. Theorem 3.3 is a lot of work for what is not a particularly interesting result: it is asymptotic in both n and t, so yields no insight into dynamics, or any relationship between n and t. Moreover it is not truly dimension-free as the authors claim; dimension implicitly shows up in the moments of f, given that \psi is positively homogeneous, so it is rather the case that dimension shows up how one might expect in a variance bound. Moreover (and this is made clearer upon examination of experimental results), it is not useful to reason about optimization time (finite or asymptotic) without reference to a discretization scheme. The experiments make reference to "epochs", but there is no optimization algorithm to relate to the flows discussed in the theory.


Review for NeurIPS paper: A Dynamical Central Limit Theorem for Shallow Neural Networks

Neural Information Processing Systems

The paper provides CLT-like results for the dynamics of single-hidden layer, wide neural networks in the mean-field limit. The authors also show that under certain conditions the long-time fluctuations can be controlled with an MC type resampling error. The reviewers had a positive assessment of the finite width analysis and the strength of some of the technical contributions. They did however raise a variety of concerns regarding the asymptotic nature of results (both in n and t), assumptions on Dhat, and lack of results with discretization. While some of these concerns were alleviated based on the authors' response, the more critical reviewers maintained their score and one positive reviewer slightly decreased theirs from 8 to 7. I agree with the reviewers that CLT type results for finite width is indeed interesting.


A Dynamical Central Limit Theorem for Shallow Neural Networks

Neural Information Processing Systems

Recent theoretical work has characterized the dynamics and convergence properties for wide shallow neural networks trained via gradient descent; the asymptotic regime in which the number of parameters tends towards infinity has been dubbed the "mean-field" limit. At initialization, the randomly sampled parameters lead to a deviation from the mean-field limit that is dictated by the classical central limit theorem (CLT). However, the dynamics of training introduces correlations among the parameters raising the question of how the fluctuations evolve during training. Here, we analyze the mean-field dynamics as a Wasserstein gradient flow and prove that the deviations from the mean-field evolution scaled by the width, in the width-asymptotic limit, remain bounded throughout training. This observation has implications for both the approximation rate and the generalization: the upper bound we obtain is controlled by a Monte-Carlo type resampling error, which importantly does not depend on dimension.